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' We apply the general protocol of parameter optimization (Lee, J. et al. Phys. Chem. B 2001, 

\ 105, 7291) to the UNRES potential. In contrast to the earlier works where only the relative weights 

^SJ ■ of various interaction terms were optimized, we optimize all linear parameters of the potential. The 

method exploits the high efficiency of the conformal space annealing method in finding distinct 
', low energy conformations. For a given training set of proteins, the parameters are modified to 

■ make the native-like conformations energetically more favorable than the non-native ones. Linear 

approximation is used to estimate the energy change due to the parameter modification. The 
parameter change is followed by local energy reminimization and new conformational searches to find 
the energies of native-like and non-native local minima of the energy function with new parameters. 
These steps are repeated until the potential predicts a native-like conformation as one of the low 
. energy conformations for each protein in the training set. We consider a training set of crambin 

Q ' (PDB ID lejg), Ifsd, and the 10-55 residue fragment of staphylococcal protein A (PDB ID Ibdd). 

C/3 ^ As the first check for the feasibility of our protocol, we optimize the parameters separately for 

these proteins and find an optimal set of parameters for each of them. Next we apply the method 
simultaneously to these three proteins. By refining all linear parameters, we obtain an optimal set 
of parameters from which the native-like conformations of the all three proteins are retrieved as the 
global minima, wtthout introducing additional multi-body energy terms. 
* To whom the correspondence should be addressed. E-mail: jlee@kias.re.kr. 
Q 1 '1^ Present address: Electronics and Telecommunications Research Institute, Daejon, 305-350, Korea. 

, ■ I. INTRODUCTION 

' The prediction of the three dimensional structure of a protein solely from its amino acid sequence is one of the most 
OO : challenging problems in computational sciences today. Popular approaches to this problem have been comparative 
modeling and fold recognition, which can be classified as knowledge-based methods.^""' These methods use statistical 
relationship between the sequences and the three-dimensional structures of the proteins in the Protein Data Bank 
(PDB) in order to predict the unknown structure of a protein sequence, without deep understanding of the protein 
folding. 

On the other hand, the ab initio method, ^"^^ which is also called the energy-based or physics-based method, is 
' based on the thermodynamic hypothesis which postulates that proteins adopt native structures that minimize their 
^ , free energies. Since it attempts to understand the fundamental principles of the protein folding itself, the success 
I of this method will lead not only to the successful structure prediction, but also to the clarification of the protein 
' ^ , folding mechanism. 

However, there have been two major obstacles to the successful application of energy-based methods to the protein 
folding problem. First, the energy landscape of a protein is riddled with an astronomical number of local minima, 
making it difficult to search. Secondly, there are inherent inaccuracies in potential energy functions which attempt to 
describe the energetics of proteins. The first problem has been largely alleviated to some extent by recent developments 
of efficient search algorithms such as the conformational space annealing (CSA) method. ^^"^^ The second problem 
is the one which is addressed in this paper. The accuracy of a given potential energy function can be improved by 
5^ , modifying its functional form as well as its parameters. In this work, we will refine the parameters of the potential 
energy without changing the functional form. 

Physics-based potentials are generally parameterized from quantum mechanical calculations and experimental data 
on model systems.^® However, such calculations and data do not determine the parameters with perfect accuracy. 
The residual errors in potential energy functions may have significant effects on simulations of macromolecules such 
as proteins where the total energy is the sum of a large number of interaction terms. Moreover, these terms are known 
to cancel each other to a high degree, making their systematic errors even more significant. Thus it is crucial to refine 
the parameters of a potential energy function before it is applied to the protein folding problem. 

In fact, an iterative procedure which systematically refines the parameters of a given potential energy function was 
presented by Lee et a\}^ Since the CSA method can efficiently sample a wide range of the conformational space of a 
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protein, the strategy is to apply this method to the proteins with known structures in order to refine the potential. 
We refine the parameters so that native- like conformations of these proteins have lower energies than non-native ones. 
The set of the proteins used for the parameter refinement is called the training set. It would be desirable to include 
many proteins in the training set, which belong to representative structural classes of proteins. However, it is quite a 
non-trivial issue to check whether this procedure itself is feasible, even for a small number of proteins in the training 
set. 

The 10-55 fragment of staphylococcal protein A (Ibdd) was used by Lee et al.^^ to refine a coarse grained potential 
called the UNRES potential, ^''"^^ where each residue is approximated by two interaction sites. This potential was 
successfully applied for predicting the unknown structures of proteins in CASP3,^' and its basic version consists 
of seven interaction terms. Lee et al.^^ optimized six relative weights of these interaction terms. Since only an a 
protein was used for the refinement, the resulting potential was suitable for a proteins. 

On the other hand, Pillardy et al.^"'^ used three training sets consisting of one protein Ipou, one protein Itpm, and 
two proteins betanova and Ibdd, to optimize the potentials for predicting a, 13, and a / j3 proteins, respectively.^^ This 
potential was extended to include six additional multi-body terms, which increased the number of relative weights to 
be optimized from six to twelve.^'' Therefore in these works, the functional form of the UNRES potential as well as 
a part of the parameters was modified, which are the twelve relative weights. The introduction of the six additional 
multi-body terms were necessary in order to incorporate proteins with /3 strands. 

However, one should note that each of the seven interaction terms in the original version of the UNRES potential 
has its own parameters in it. Therefore it is a natural question to ask whether one can optimize the potential energy 
function for the proteins with /3-strands by refining these parameters, without introducing additional multi-body 
energy terms. Of course, it might not be possible to optimize the potential for arbitrarily many proteins without 
introducing additional interaction terms. However, it is important to optimize the parameters as much as possible 
before introducing functional modifications, since this will give us better insights on the limitations of a given potential 
and the types of additional interaction terms necessary for improvement. 

Indeed we observe that it is possible to refine the UNRES potential with three proteins Ibdd, Ifsd, and crambin 
(lejg) without introducing additional multi-body terms, where proteins Ifsd and lejg contain (3 strands. First, the 
parameters are optimized separately for these proteins, and an optimal parameter set for each of them is obtained. 
The potentials with optimized parameter sets yield global minimum energy conformations (GMECs) with root mean 
square deviations (RMSDs) of 1.7, 2.5, and 2.6 A ffrom the native structures, for Ibdd, Ifsd, and lejg respectively. 
Finally the parameters are refined for the training set consisting of these three proteins, and a parameter set is 
obtained which correctly describe the energetics of these proteins simultaneously. The potential with the optimized 
parameter set yields the GMECs with RMSDs of 1.8, 2.5, and 2.6 A, respectively. 

II. METHODS 
A. General Protocol 

A brief description of our procedure is as follows. In order to check the performance of a potential energy function 
for a given set of parameters, one has to sample native- like conformations as well as non- native ones. Non- native 
conformations can be obtained by an unrestricted conformational search which we call global CSA. Native-like con- 
formations are obtained by a restricted search which we call local CSA. In the local CSA, only the conformations 
whose RMSDs from the native structure are below a preset cutoff value, are sampled. 

Since a potential can be considered to describe the nature correctly if native-like structures have lower energies than 
the non-native ones, the optimization criterion is given in terms of the energy gap, which is the difference between 
the lowest energy of the native- like conformations and that of the non- native ones. We define the energy gap to 
be negative when the lowest energy of the native conformations is lower than that of non-native ones. We modify 
the parameters so that the energy gaps of the proteins in the training set decrease. The changes of energy gaps are 
estimated by the linear approximation of the potential in terms of parameters (See section D.). Since the positions 
of the local energy minima are shifted due to the parameter modification, it is necessary to reminimize their energies 
with the new parameters. We also search the conformational space with the newly obtained parameters to find new 
low- lying local energy minima. Together with the energy-reminimized conformations, these constitute a structural 
database which will be used for subsequent refinement of the parameters. We iterate these steps until the energy gaps 
become all negative for proteins in the training set. The detailed explanation for each step of our procedure is given 
below. 
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B. Potential Energy Function 



We use the UNRES force field/^"^^ where a polypeptide chain is represented by a sequence of a-carbon (C") atoms 
linked by virtual bonds with attached united side chains (SC) and united peptide groups (p) located in the middle 
between the consecutive C"'s. All the virtual bond lengths are fixed; the C"-C" distance is taken as 3.8 A, and C"-SC 
distances are given for each amino acid type. The energy of the chain is given by 

E = ^Uscsc{i,j) +'uJscp^Uscp{i,j) +'Wpp ^ Upp{i,j) + Wb^Ub{i) 

i<j ijtj i<j-l i 

+Wtoi ^ Utoi{i) + Wrot ^ C^rot(i) + W'dist^dis + w'^t-loc X] ^il-loc(*' j)' (1) 
i i i<j 

where w's are the relative weights which were refined in the earlier works.^®'^-"-'^^ As described in detail in the 
appendix, J7scsCi C^scpi C^ppj CAor, and uj,f\^^ can be further decomposed into linear combinations of smaller parts, 

whose coefficients are refined in this work. Therefore we may fix the values of wscp; w^pp; w'tor, and w^g^lio^. without loss 
of generality. We set them to unity for simplicity. Here, Uscsc{i, j) represents the mean free energy of the hydrophobic 
(hydrophilic) interaction between the side chains of residues i and j, which is expressed by Lennard- Jones potential, 
Uscpihj) corresponds to the excluded- volume interaction between the side chain of residue i and the peptide group of 
residue j, and the potential Upp{i,j) accounts for the electrostatic interaction between the peptide groups of residues 
i and j. The terms Utoi{i), Ub{i), and Urot{i), denote the short-range interactions, corresponding to the energies of 
virtual dihedral angle torsions, virtual angle bending, and side chain rotamers, respectively. Udis denotes the energy 
term which forces two cystein residues to form a disulfide bridge. Finally, the four-body interaction term U^^_^^^ 
results from the cumulant expansion of the restricted free energy of the polypeptide chain. In contrast to the earlier 
works^^'^^ where additional multi-body terms were introduced,^® ^ei^-ioc only multi-body term used in this 

work. The detailed forms of these terms are given in the appendix. As discussed there, the total number of linear 
parameters which we adjust is 709. The functional form Eq.(l), as well as the initial parameter set we use, is the one 
used in the CASP3 exercise.'^' ^° 



C. Global and Local CSA 

In the protein folding problem, the energy surface contains an astronomical number of local energy minima. The 
larger a protein is, the more likely it is that there exist many local energy minima which correspond to very different 
structures. In general, it is not sufficient to consider only the lowest energy conformation as a possible candidate 
for the native structure. Since the force field parameters contain inevitable errors, one should take account of many 
distinct low energy conformations. Therefore it is necessary to search the whole conformational space. 

It has been shown that this multiple minima problem can be overcome by an efficient search algorithm such as the 
CSA method. In this work, extensive conformational searches are carried out by global and local CSA methods. "'^^""'^^ 
The CSA method can be considered as a genetic algorithm that enforces a broad sampling in its early stages and 
gradually allows the conformational search to be focused into narrow conformational space in its later stages. As a 
consequence, many low-energy local minima including the GMEC of the benchmark protein can be identified for a 
given parameter set. Unless the parameters are properly optimized, these conformations can be quite different from 
the native structure. Therefore, in this case, we may consider the global CSA as the sampling of the non-native 
conformations. On the other hand the native-like conformations are sampled by the local CSA search. The local 
CSA is the restricted search where only the conformations whose C" RMSD values are within a fixed cutoff, Rc, of 
the native conformation, are sampled. Also, in order to find these native-like structures, the initial conformations are 
prepared with the native backbone coordinates, whose energy is subsequently minimized. The value of Rc should 
be large enough to sample representative native-like conformations and at the same time small enough to eliminate 
non-native conformations. 



D. Linear Approximation and Parameter Refinement 

Once the energies of the non-native and native-like conformations for all proteins in the training set are obtained, 
the parameters are modified as follows. We select the protein with the largest energy gap, and change the parameters 
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so that this energy gap decreases. The parameters are changed by small amounts at each step, so the energy with the 
new parameters can be estimated by the linear approximation: 



where the p° s and pf°^'s represent the parameters before and after modification, respectively. The parameter 
dependence of the position of the local minimum can be neglected in the linear approximation, since the derivative in 
the conformational space vanishes at a local minimum. In general the derivative ^ is a function of the parameters, 
but for linear parameters it is just a constant independent of parameters. In this work, we adjust only the linear 
parameters for simplicity, the total number of them being 709 for the UNRES potential. The details can be found in 
the appendix. Therefore the energy function can be written as: 



E = Y^PiCi. (3) 

i 

where e^'s are the coefficients independent oi Pi. The change of the energy gap is estimated as: 

AiJgap = E,,p{{pr}) - i^gap({pf }) 

_ ^^(lowest Nj^jpiiewj^-j _ ^(lowest NN) ^^^newj-j-j _ ^^(lowest N)^^^old|-^ _ ^(lowest NN) j-i^oldj-j^ 

= ^[ei(lowest N) - [ei(lowest NN)](pr^ - pf'^). (4) 

i 

where E and e arc evaluated for the lowest energy native-like (N) and non-native (NN) conformations. We fix the 
magnitude of the parameter change Spi = p^^"" —pf'^ to be a certain fraction a of pf'^. We use a = 0.01 in this study. 
The sign of 5pi is chosen to decrease the energy gap, 

5pi = -apf'^sign[e^{\owcsi N) - e,; (lowest NN)]. (5) 



We repeat this procedure of selecting the protein with the largest energy gap and modifying the parameters, until the 
energy gaps estimated by Eq.(4) become all negative for proteins in the training set. The flow chart for this part of 
the algorithm is shown in Fig. 1. 



E. Reminimization and new conformational search 



Since the procedure of the previous section was based on the linear approximation Eq.(4) and the number of confor- 
mations in the structural database is limited, we now have to evaluate the true energy gap using the newly obtained 
parameters. The breakdown of the parameter refinement may come from two sources. First, the conformations corre- 
sponding to the local minima of the potential for the original set of parameters are no longer necessarily so for the new 
parameter set. For this reason, we reminimize the energy of these conformations with the new parameters. Secondly, 
the local minima obtained using CSA method with the original parameter set are only a tiny fraction of the whole 
set of local minima. After the change of the parameters, some of the local minima which were not considered due to 
their relatively high energies, can now have low energies for the new parameter set. It is even possible that entirely 
new low-energy local minima appear. Therefore these new minima are taken into account by performing subsequent 
CSA searches with the newly obtained parameter set. 



F. Update of the structural database and iterative refinement of parameters 

The low-lying local energy minima found in the new conformational searches are added to the energy-reminimized 
conformations to form a structural database of local energy minima. The conformations in the database are used to 
obtain the energy gaps, and if their values are not satisfactory, these conformations are used for the new round of 
parameter refinement. As the procedure of [CSA parameter refinement — ^ energy reminimization] is repeated, the 
number of conformations in the structural database increases. As an example, the energy-RMSD plot of an energy- 
reminimized structural database for the protein lejg is shown in Fig. 5(c). The flow chart of the whole procedure is 
illustrated in Fig. 2. This iterative procedure is continued until the energy gaps become negative for all proteins. 
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G. Choice of RMSD cutoff 



It is important to choose the RMSD cutoff judiciously for each protein in the training set, in order to carry 
out the whole procedure efficiently. This cutoff is the criterion for distinguishing the native-like and non-native 
conformations. The cutoff is necessary in two places in the procedure. First, it is used in the local CSA where 
conformations with RMSDs below a preset cutoff value are sampled, and secondly, in the parameter refinement step 
where the conformations in the structural database are divided into native-like and non-native families. In general, 
these two cutoff values can be different from each other. In addition, a separate cutoff value can be used for each 
iteration. In this work we check the distribution of RMSD vs. energy of the conformations by visual inspection, and 
cluster them into native-like and non-native families to determine the appropriate values of RMSD cutoff for each 
protein. The values of RMSD cutoff used are given in table I and II. 

III. RESULTS 

We consider a training set consisting of three proteins. They are the 10-55 fragment of the B-domain of staphy- 
lococcal protein A (Ibdd), Ifsd, and crambin (lejg), which are 46, 28, 46 residues long respectively. We first refine 
parameters for these proteins separately, to check whether our protocol for the iterative parameter refinement is 
feasible. 



A. Separate parameter refinement for each protein 

This is the simplest case of having one protein in the training set. The first example is the Ibdd, which was the 
target protein of the previous study of the weight optimization.^^ In that work, a negative energy gap was found after 
6 iterations, and the GMEC which has a 2.2 A RMSD deviation from native structure was obtained. We start with 
the same initial parameters, which were used in CASP3.''' 

In the CSA sampling with the original parameter set, the GMEC was of 3.8 A RMSD from the native structure,^ 
as shown in Fig. 3(a). We set Rc = 3.0 A , adjust the parameters according to it, and proceed to the next iteration. 
A negative energy gap is found after 3 iterations of parameter refinement with Rc = 2.2 A and the global CSA search 
yields the global minimum at 2.2 A. Furthermore, from the local CSA run we find conformations with RMSDs lower 
than 2.0 A (Fig. 3(b)). Therefore we repeat the procedure with lower values of Rc- After 11 iterations the GMEC 
with RMSD= 1.7 A is obtained (Fig. 3(c)). We have further proceeded with an even lower value of Rc- However, the 
energy gap does not improve after the 11-th iteration. Therefore we take the result from the 11-th iteration as the 
final optimized parameter set for this protein. The results are shown in Fig. 3(c). Similar procedures are repeated for 
Ifsd and lejg to obtain optimized parameter sets which yield GMECs with RMSD values 2.5 A and 2.6 A respectively. 
Details are shown in tables I, II and figures 3, 4, 5. 



B. Simultaneous psirameter refinement for three proteins 

Again, the initial parameter set is the one used in CASP3.'''^'' Since a large number of conformations were already 
accumulated in the structural databases during the separate parameter optimizations for three proteins, we use them 
to start the iterative procedure of simultaneous parameter refinement for the three proteins. 

By choosing appropriate values of RMSD cutoff for each iteration, we obtain an optimized parameter set after 6 
iterations, yielding the GMECs with RMSD values of 1.8 A, 2.5 A, and 2.6 A for Ibdd, Ifsd, and lejg, respectively. 
The energies and RMSDs of the conformations obtained with the optimized parameters are plotted in Fig. 6, and 
the C" trace of the GMEC conformations are shown in Fig. 7 along with the native conformations. The numerical 
values of the optimized parameters are provided in the Supporting Information. As an example of the changes of 
interaction terms due to the parameter optimization, the torsional energies between residues which are neither glycine 
nor proline, with the optimized and original parameter set, are plotted in Fig. 8. 

IV. JACKKNIFE TEST 

It should be noted that the purpose of the present work is not to provide a potential which is transferable to 
all proteins, but to develop a methodology for optimizing potential parameters of a given potential. Applying this 
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method to develop a transferable potential is out of the scope of this paper, and a much larger training set and even 
additional interaction terms might be necessary in order to achieve it. In fact, it is quite nontrivial to check whether 
such a procedure is possible at all. However, we performed conformational searches for proteins not included in the 
training set, which is called a Jackknife test, and find some interesting features. 

It should be noted that a mere comparison of the RMSDs of low energy conformations found from the optimized 
parameters with the native structure is not so meaningful. Rather, we should check if the low energy conformation 
from the new parameters are closer to the native structure in comparison to those from the original parameters. 
We considered the 1-32 segment of the 36-residue protein Ibba. This protein contains a C-terminal a-helix with 
an N-terminal extended strand parallel to the helix. The NMR structure of the protein is shown in fig. 9 (a). Using 
CSA, 200 low energy conformations are sampled for both the original parameters and the optimized parameters, 
respectively. The lowest RMSD values are 6.2 A and 5.8 A for the original and optimized parameters respectively, 
whereas the GMECs' RMSD values are 7.6 A and 7.9 A respectively. Although the RMSD values are rather large, 
interesting qualitative differences in the secondary structures of the sampled conformations are observed, which is 
difficult to be recognized by RMSD values alone. The lowest RMSD conformation and GMEC for the optimized 
parameters are shown in fig. 9(b) and (c), and those for the original parameters are shown in fig. 9(d) and (e). We 
observe that, for conformations from the optimized parameters, the a-helices extend to the end of the C-terminal, 
which is in good agreement with the native structure. On the other hand, the corresponding a-helices are incomplete 
near the C-terminal for the conformations obtained with the original parameters. In addition, the extended strand at 
the N-terminal is reproduced better with the optimized parameters. We find these are the prevailing features of all 
400 conformations we have sampled. 

We also performed Jackknife tests on other proteins, whose results are not shown here. For some a/ (3 proteins, 
notably 1L4V, the resulting conformations have similar qualitative features as above; i.e. the optimized parameters 
perform better in assigning secondary structures. On the other hand, the results for pure a or /3 proteins are not so 
conclusive. 



V. CONCLUSION AND DISCUSSION 

We applied the general protocol for the force field parameter optimization of Lee et al.^^ to the UNRES potential 
used in CASP3.^'^" We optimized the parameters separately for the 10-55 fragment of staphylococcal protein A 
(Ibdd), Ifsd, and crambin (lejg), and could obtain an optimal parameter set for each of them, giving the GMEC 
with RMSD value of 1.7, 2.5, 2.6 A rrespectively. We could also obtain a parameter set which correctly describes the 
energetics of these three proteins simultaneously. This optimized parameter set yielded GMECs with RMSD values 
of 1.8, 2.5, and 2.6 A for Ibdd, Ifsd, and lejg. 

In contrast to the earlier works^^'^^ where only relative weights were optimized, we refined all 709 linear parameters 
of the UNRES potential. This enabled us to optimize the UNRES potential of Eq.(l) without introducing additional 
multi-body energy terms. In particular, it is demonstrated for the first time that the energetics of proteins containing 
(3 strands can be correctly described using the energy terms in Eq.(l) only. 

It would be interesting to see how many proteins can be energetically well described using a given force field. 
This should provide a good measure for the efficacy of existing force fields. Once the parameters for a potential is 
successfully refined for the proteins in a given training set, we should perform a Jackknife test on proteins not included 
in the training set. If this test is successful, we may confidently use this potential for predicting the unknown structure 
of a given amino acid sequence. 

Before tackling these more challenging problems, there are several points in our protocol which should be improved. 
First of all, in the step of parameter refinement, we decreased the largest among the energy gaps of the proteins 
without any restriction, and repeated this procedure. However this can become quite inefficient as the number of 
proteins in the training set increases. We need to implement a constrained optimization where we require that the 
energy gaps of other proteins in the training set do not increase while that of a given protein decreases. Secondly, the 
value of RMSD cutoff at each iteration were determined from visual inspection. It would be better if one can devise a 
natural criterion for choosing RMSD cutoffs. Thirdly, in principle, one can also refine nonlinear parameters, which was 
not carried out in this work. Finally, although wc considered only the UNRES potential for parameter optimization 
in this work, it is straightforward to apply the procedure to other potentials such as ECEPP,-^* AMBER,-^^ and 
CHARMM^*^ with various solvation terms. All these points are left for the future study. 
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APPENDIX A: THE UNRES POTENTIAL AND ITS LINEAR PARAMETERS. 

The united residue(UNRES) potential is given by the expression: ^''"^^ 

E = ^Uscsc{-i,j)+'wscp^Uscp{-i,j)+u!pp ^ Upp{i,j) + Wb^Ub{i) 

i<j ijtj i<j-l i 

+Wtoi ^ Utoi{i) + Wrot ^ Urot{i) + ^dist^dis + ^"el'-loc f^il-loc(^' j)' (^1) 
i i i<j 



1. Side-Chain Interactions 



The interactions between the side-chains are given by the Lcnnard- Jones type potential; 

t/SCSc(t,j) = + -6 ' 

ij ij 

ti = 1, • • • 20 being the amino acid type of the i-th residue. The linear parameters we optimize in this work are ascsc 
and 6scsc which comprise a total of . 2 = 420 parameters. 



2. Peptide-Peptide Interaction 



The peptide-peptide interaction is decomposed into Lennard-Jones type interaction and electrostatic interaction: 

Upp{i,j) = C/Lj(i,i) + t/es(i, j) (A3) 

with the Lennard-Jones type interaction 

Ulj{i,J)- -js + -g (A4j 



and electrostatic interaction 



Ues{i,j) = "'''^'^g'"^'^'* [4 -I- (cos aij - 3 cos /?» j cos 7^ ^ ) ^ - 3 (cos^ Pij + cos^ jij ) ] 

+ ^^^^"^3' ^■'^ (cos aij — 3 cos f3ij cos 7,^- ) ( A5) 

'''ij 



where cosajj = {rii ■ rij) , cos (3ij — {rii -rji), cos 'jij = {nj-rji), and Uj is the vector along the i-th peptide. The integer 
li denotes the type of peptide group i. There are only two types, proline and non-proline, with li = 1,2. The term 
with J = i -|- 1 is not inchidcd in the summation of (Al) since it can bo absorbed into the local energy terms such as 
the bending energy. The linear parameters we adjust for this interaction are Upp, bpp, a^i, and b^i, which comprise a 
total of 4 • ^ = 12 parameters. 
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3. Side-Chain Peptide Interaction 



The side-chain peptide interaction is given by a Lennard-Jones type interaction: 

TT i~ ,^ _ a'SCpih,tj) , bscp{Ii,tj) 

Uscp{i,J)- -12 + 16 (^6) 

ij ij 

Again, j = ?' ± 1 is not included in the summation of (Al) since it can be absorbed into the local rotamer energy. 
Therefore the nearest neighbors which can contribute to this interaction are j — i ±2. If we use the same parameter 
values for these residues, they dominate this interaction and we get unphysical results. Therefore one usually use 
smaller parameter values for these residues in order to avoid problems. This may seem ad-hoc, but conceptually one 
may justify it by noting that for the residues close in sequence, the quantum effect may become important, which 
modifies the classical interaction parameters. Therefore, we define additional peptide groups /i = 3, 4 for the non- 
proline and proline with j = i±2. The linear parameters to be refined are ascp and bscp, which comprise a total of 
2 • 20 • 4 = 160 parameters. 



4. Disulfide Bridge Energy 



To form disulfide bridges between cysteins, the following energy term is introduced: 

Udis = wais ^ l{D{hi{i'), Hi')) - Do)^ (A7) 

where the i' and hi{i'), h2{i') label the disulfide bridges and the residue numbers forming that bridge, respectively. 
The overall weight w^is is the only linear parameter to be refined for this term. 



5. Torsional Energy 



The twist of the virtual bond between {i — 2)-th and {i — l)-th residues defines the torsion angle ji. Therefore the 
torsion energy for 7; depends on the amino acid types of these residues, which we denote by the integer Ji_2 and 
Jj_i. There are three types of amino acid residues for this interaction, glycine, proline, and the rest, with Jj = 1, 3, 
2, respectively. For the torsion energy between two prolines, that is, when Ji_2 = Ji-i = 3, we have 



3 

Utors{i) = ^{vi{j + 1, 3, 3) cos(j7i) 



+V2{j + 1,3,3) sin(j7i) 

+\vi{j + 1,3,3)1 + \v2{j + 1,3,3)\ 

+ /^i(1.3,3)i±i||^ for-|<7,,;<. 

^lO for-7r<7i<-f ^ ' 

and 

6 

Utora{i) = ^{I'lU^ Ji-2, Ji-l) COs{jji) 
j=l 

+«2(j, ^4-2. Ji-i) sin(j7i) 

+ \Vi{j,Ji-2,Ji-l)\ + \V2{j,Ji-2,Ji-l)\ (A9) 

otherwise. The linear parameters to be refined are Vi{j, k, I), V2{j, k, I) with j = 1, ■ ■ ■, 6, k,l = 1, 2, 3, which comprise 
a total of 108 parameters. 
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6. Local Side-Chain Energy 



This energy is the negative log of a probability distribution, which is given by the sum of Gaussian peaks: 



Uiot{i) = -Wrotlog 



(AlO) 



where Zj^i = Xi — Cj,i. , n(ti) is the number of Gaussian peaks in the distribution which is the fimction of amino acid 
type of the i-th residue U, and Xi — [cot Oi,ai, Pi) for k = 1,2,3, 6i, at, Pi being the bending angle and the polar 
angles of the side-chain, respectively.^ ''"^^ The values of the non-linear parameters bj^n, Cj^a, Gj,i are fixed in this 
work to those used in GASPS, so the only linear parameter to be refined for this term is its overall weight Wrot- 



7. Bending Energy 

The form is similar as the local side-chain energy, except there are two Gaussian peaks for all amino acids types. 
We have 



where 



Uh{i) = -Wfelog 



exp 



{Oi 



k{9c) exp 



(All) 



6c = ai(tj)cos(7j) -I- a2{ti)sin{'yi) 

+bi{ti)cos{'yi+i) + 62(ii)sm(7i+i) 

aciOc)-^ = 2{p3{U)0l +P2{ti)6l+px{ti)ec+Po{ti)f 
+2so(ii) 



k{ec) = exp gi{ti) - 



253 (^^ 



(A12) 

(A13) 
(A14) 



and again the values of the non-linear parameters aj{ti), bj(ti), PjiU), so{ti), and gj(ti) are fixed to those used in 
GASPS, '''^'^'^'^ so the overall weight Wb is the only linear parameter to be optimized for this term. 



8. Multibody Term 

If i-th residue is in contact with j-th residue, this term contributes if (i -|- l)-th residue is also in contact with 
{j + l)-th residue or {j — l)-th residue. In this case, the energy reads 

Uitliocihj) = -Pu{i,3),uii+^,k)H^^[C+E+{^,J)E+{l + l,k) + C.E_{j, i)E_{i + 1, k)] (A15) 

"ij ' i+lk 

where fc = j -I- 1 or fc = j — 1. fi.j is the contact function which is 1 when the distance between the residues is less 
than a given cutoff, and when far way, and a smooth function in the intermediate region. C± are fixed numbers 
which are independent of residue types, and E±{i,j) — E^ ± E''l_ where 

E^{i,j) = [4(1 ± cosaij) + (cosa^ — 3cos/3y cos7y)^ — S(cos/3y ± cos7,j)^)]"^^^ (-^-16) 

and aij,Pij,"fij are the same as in the peptide-peptide interaction. The integer u{i, j) — 1,2,3 when (i,j) pair is 
(non-proline,non-proline), (non-proline, proline) , (proline,proline) respectively. We see that Pu{i,i),u{i+i,k) comprise a 
total of 6 parameters since it is symmetric under the exchange of two indices. 
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9. Total Number of Linear Pcirameters 



Therefore the total number of Hnear parameters we adjust is: 

420 (SC-SC) + 12 (p-p) + 160 (SC-p) 
+ 108 (torsion) + 6 (multi-body) 

+ 1 (bending) + 1 (local side-chain) + 1 (disulfide bridge) = 709 
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Choose the protein with the largest energy gap 
and refine parameters to decrease the gap 



Stop. 

New parameter set {p" 
obtained. 



FIG. 1. The flow chart for the part of the algorithm where the parameters are reflned using linear approximation for the 
energy gaps. 



Initial parameter set {pg} 







Global and Local 
CSA 



No 



Add 




Yes 



Refine parameters using linear approximation 
for energy gaps (Fig.l) 



New parameter set {p} 



Structural 
database 



Stop 



Energy 



Reminimization 



FIG. 2. The flow chart for the whole protocol of the iterative parameter refinement. The parameter refinement step using 
the linear approximation for the energy gap corresponds to the flow chart of Fig. 1. 
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FIG. 3. Plots of the UNRES energy and C" RMSD (from the native structure) for conformations of Ibdd obtained from 
CSA seaxches. (a) The result with the initial parameter set. We observe that the GMEC is of RMSD 3.8 A, and there are 
native-like conformations with RMSDs less than 2.4 A. (b) The result after 3 iterations. In all the figures in this paper, the 
plus signs denote the conformations from the global CSA, and the squares denote those from the local CSA. We observe that 
the GMEC is of RMSD 2.2 A, and there are native- like conformations with RMSDs less than 2.0 A. (c) The final result after 
11 iterations. We observe that the GMEC is of RMSD 1.7 A. 
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FIG. 4. Plots of the UNRES energy and C" RMSD (from the native structure) for conformations of Ifsd obtained from 
global and local CSA searches, (a) The result with the initial parameter set. We observe that the GMEC is of RMSD 5.4 A, 
and there are native-like conformations with RMSDs less than 2.8 A. (b) The result after 3 iterations. We observe that the 
GMEC is of RMSD 2.8 A, and there are native-like conformations with RMSDs less than 2.6 A. (c) The final result after 12 
iterations. We observe that the GMEC is of RMSD 2.5 A. 
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FIG. 5. Plots of the UNRES energy and C" RMSD (from the native structure) for conformations of lejg obtained from 
global and local CSA searches, and the ones in the structural database, (a) The result with the initial parameter set. We 
observe that the GMEC is of RMSD 8.8 A, and there are nativc-likc conformations with RMSDs less than 2.5 A. (b) The final 
result after 13 iterations. We observe that the GMEC is of RMSD 2.6 A. (c) The plot of the conformations in the structural 
bank whose energies are reminimized with the optimized parameter set. 
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FIG. 6. Plots of the UNRES energy and C" RMSD (from the native structure) for conformations of the three proteins 
obtained from global and local CSA searches with the parameter set which is optimized for these proteins simultaneously, (a) 
The plot for Ibdd. We observe that the GMEC is of RMSD 1.8 A. (b) The plot for Ifsd. The GMEC is of RMSD 2.5 A. (c) 
The plot for lejg. The GMEC is of RMSD 2.6 A. 
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(c) 



FIG. 7. (a) The C" trace of Ibdd. The native structure is shown in red and the GMEC found with the optimized parameters 
is shown in yellow. The RMSD is 1.8 A. (b) The C" trace of Ifsd. The native structure is shown in red and the GMEC found 
with the optimized parameters is shown in yellow. The RMSD is 2.5 A. (c) The C" trace of lejg. The native structure is 
shown in red and the GMEC found with the optimized parameters is shown in yellow. The RMSD is 2.6 A. The figures were 
prepared with the program MOLMOL.^^ 
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FIG. 8. The torsional potential between the residues which are neither glycine nor proline, as a function of the torsional 
angle 7. The solid (dashed) line is obtained with the optimized (original) parameters. 
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FIG. 9. The C" trace of the 1-32 segment of the protein Ibba. The figures were prepared with the program MOLMOL.^^ (a) 
The native structure. The residues 15-32 form an a hehx, and there is an extended strand consisting of residues 1-12. (b) The 
conformation with the lowest RMSD found with the optimized parameters. The RMSD is 5.8 A. We observe that the a helix 
at the C-tcrminal is partially formed, consisting of residues 15-19 and 21-32. (c) The GMEC with the optimized parameters. 
The RMSD is 7.9 A. Again the a-helix is correctly formed except at residue 20. (d) The lowest RMSD conformation found with 
the original parameters. The RMSD is 6.2 A. The position of the a helix is shifted, to 9-12, 14-24 and 27-29. (e) The GMEC 
with the original parameters, with RMSD 7.6 A. Again, we find a helices are formed at wrong positions, 10-13 and 15-24. 



TABLE I. The values of RMSD cutoffs used for local CSA searches and the parameter refinements for the case of separate 
optimizations (units in A). The integer i denotes the i-the iteration of CSA search, and i ^ i + 1 denotes the parameter 
refinement step from i-th. to i -|- 1-th iteration. 
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12 13 
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2.5 


° The local CSA was 


not carried out because the global CSA was enough to find native-like conformations. 


^ These values of the RMSD cutoff are used sequentially during 


the parameter refinement. 




TABLE II. The values of RMSD cutoff used for local CSA searches and the parameter refinements for the 


case of simultaneous 


optimizations (units in 


A). 
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" The initial conformational search is not necessary since we use the stnictural databases accumulated from the 
separate optimizations of three proteins. 

^ These values of the RMSD cutoff are used sequentially during the parameter refinement. 
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